The aim of this project is to investigate how the performance of elite runners is linked to factors like height, weight and age. This will be done by analysing historical data about athlete performances through history, mainly in modern Olympic running events. Understanding the relationship between these characteristics and athlete performance would be extremely helpful in developing training plans and improving athlete performance in the future.
Athlete performance is clearly affected by many factors, and this analysis will be limited to just a few of them, dictated mainly by the data available. The specific questions addressed here are:
There will be no particular modelling or machine learning in this analysis because the questions can be answered by visualising the statistics alone.
# Import libraries
import pandas as pd
import chardet # For character encoding
import ftfy # For fixing encoding issues
from matplotlib import pyplot as plt
from matplotlib import pylab as plb # For best fit lines
from datetime import datetime, time, timedelta, date
import numpy as np
from fuzzywuzzy import fuzz # For inexact ("fuzzy") string matching
from fuzzywuzzy import process
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() # Future compatibility plotting datetime
This analysis will attempt some originality by combining three separate data sets. This allows athlete characteristics to be linked to athlete performances so any relationship between the two can be investigated.
First, load the data sets and briefly examine them.
The first data set is the Olympic Games results and athlete data, 1896-2016.
Source:
# Results data is in the first file:
all_olympics = pd.read_csv('datasets/athlete_events.csv')
all_olympics.head()
all_olympics.info()
Summary
The full Olympic Games data set contains useful information over a 120 year period about the competitiors (height, weight, age, country of origin, and medal, if they won one).
The second data set contains the Olympic track and field times and results. Note it only includes data for medal winners. Source:
# Data set 2 - Olympic track and field times and results. Source:
# www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1
# There is an additional unlabelled column in a few of the rows.
# Therefore, read explicitly labelled columns and discard unlabelled column.
ol_tf = pd.read_csv('datasets/results.csv', names=['Gender',
'Event',
'Location',
'Year',
'Medal',
'Name',
'Nationality',
'Result'])
ol_tf.drop(index=0, inplace=True)
ol_tf.head()
ol_tf.info()
Summary
The most useful feature of the track and field results is the detailed running times and event results. This will be linked to the full Olympic data (including its information on the athletes' characteristics) later in the analysis.
The third data set contains the top 1000 running performances for each running event.
Source:
https://www.kaggle.com/jguerreiro/running/downloads/running.zip/2
top_running = pd.read_csv('datasets/data.csv')
top_running.head()
top_running.info()
Summary
This data set is good because it contains a large number of data points (1000) including finish times for every running discipline. It is not limited to Olympic performances, but all the events are Olympic distances, with the exception of the half marathon.
print("Number of unique events is {}"
.format(len(all_olympics['Event'].unique())))
765 events is far too many to analyse. It also includes some events which have not taken place in the Olympics for a long time. This analysis is focussed on modern running events, so we will extract a subset of the results.
olympic_sports_groups = all_olympics.groupby('Sport')
athletics = olympic_sports_groups.get_group('Athletics')
all_athletics_events = athletics['Event'].unique()
all_athletics_events
This is a more manageable list of events. There are still some events here that don't exist in the modern Games. The next step is to remove any events that didn't take place in the most recent summer Games (2016).
modern_athletics_events = athletics[
athletics['Year'] == 2016]['Event'].unique()
modern_athletics_events
removed_events = set(all_athletics_events).difference(modern_athletics_events)
removed_events
indices_to_remove = [athletics.index[i] for i in range(len(athletics)) if
athletics['Event'].iloc[i] in removed_events]
modern_athletics = athletics.drop(index=indices_to_remove)
modern_athletics['Event'].unique()
This analysis will focus on individual running events. So, now remove the field events and non-running events.
# These are the events to keep for the analysis.
modern_individual_running_events = {
"Athletics Women's 100 metres",
"Athletics Men's 1,500 metres",
"Athletics Men's 5,000 metres",
"Athletics Men's 110 metres Hurdles",
"Athletics Women's Marathon",
"Athletics Men's 100 metres",
"Athletics Men's 400 metres Hurdles",
"Athletics Men's 400 metres",
"Athletics Men's 800 metres",
"Athletics Men's Marathon",
"Athletics Men's 10,000 metres",
"Athletics Men's 200 metres",
"Athletics Men's 3,000 metres Steeplechase",
"Athletics Women's 200 metres",
"Athletics Women's 5,000 metres",
"Athletics Women's 10,000 metres",
"Athletics Women's 1,500 metres",
"Athletics Women's 800 metres",
"Athletics Women's 400 metres",
"Athletics Women's 400 metres Hurdles",
"Athletics Women's 100 metres Hurdles",
"Athletics Women's 3,000 metres Steeplechase"}
removed_events = set(modern_athletics_events).difference(
modern_individual_running_events)
removed_events
indices_to_remove = [modern_athletics.index[i]
for i in range(len(modern_athletics)) if
modern_athletics['Event'].iloc[i] in removed_events]
ol_running = modern_athletics.drop(index=indices_to_remove)
ol_running.head()
# Check for missing values in each column.
ol_running.isnull().sum()
Many rows have no entry for a medal, and this is expected - many competitors do not win a medal, so there is no special treatment needed for missing values in the Medal feature. There are also a lot of missing values for height, weight and age, these will be examined now.
age_missing = ol_running[ol_running['Age'].isnull()]
weight_missing = ol_running[ol_running['Weight'].isnull()]
height_missing = ol_running[ol_running['Height'].isnull()]
age_missing.head()
weight_missing.head()
height_missing.head()
We now have three groups of rows that have at least one missing value. Now find out if they overlap by using sets:
age_missing_indices = set(age_missing.index)
weight_missing_indices = set(weight_missing.index)
height_missing_indices = set(height_missing.index)
print("The number of rows where both height and weight are missing is {}"
.format(len(weight_missing_indices.intersection(
height_missing_indices))))
print("The number of rows where both age and weight are missing is {}"
.format(len(age_missing_indices.intersection(weight_missing_indices))))
print("The number of rows where both age and height are missing is {}"
.format(len(age_missing_indices.intersection(height_missing_indices))))
print("The number of rows where age, height and weight are missing is {}"
.format(len(age_missing_indices.intersection(
height_missing_indices,
weight_missing_indices))))
Of the rows where either height (2987) or weight (3131) are missing, most (2961) of them are missing both height and weight. Of the rows where age is missing (667), most (at least 504) are also missing either weight, height or both. The overlap between the missing data sets is large, which is good news, because it means more of the rows are fully populated, so more of this data is usable without dropping data or imputation. For now, all the data will be kept (not dropping rows with missing data).
The Event feature is a categorical variable. This will be encoded as follows:
This method of encoding is chosen because it groups together similar types of events (e.g., hurdles events are treated as a group, flat track events are treated as a separate group) and also separates them by the distance of each event (100m, 200m, etc.)
# Simple string processing in Event column
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Women's ", "")
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Men's ", "")
ol_running['Event'] = ol_running['Event'].str.replace(" metres", "")
ol_running['Event'] = ol_running['Event'].str.replace(",", "")
ol_running.head()
Adding the new columns, copying the values between columns and removing duplicates is repetetive so write a function for this:
def encode_events(df, col, to_replace, replacement):
"""
Helper function to insert new columns,
copy and convert values to the correct column
"""
# Insert new column
df.insert(df.columns.get_loc('Event'), col, 0)
# Copy values across to new column
df.loc[df['Event'].str.contains(to_replace), col] = df[
'Event'].str.replace(to_replace, replacement)
# Remove values from original column
df.loc[df[col] != 0, 'Event'] = '0'
def string_to_int(df, features):
"""
Helper function to cast string values to integers.
"""
for feature in features:
df[feature] = pd.to_numeric(df[feature], downcast='integer')
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = [' Hurdles', 'Marathon', ' Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_running, new_columns[i], to_replace[i], replacement[i])
ol_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_running, columns_to_int)
The other two data sets refer to this as 'Gender'. For ease of comparison, change the name of this feature from 'Sex' to ''Gender'.
ol_running.rename(columns={'Sex': 'Gender'}, inplace=True)
For ease of comparison with the other data sets, convert 'Gold' to 'G', 'Silver' to 'S', and 'Bronze to 'B'
medals = ['Gold', 'Silver', 'Bronze']
short_medals = ['G', 'S', 'B']
for i in range(len(medals)):
ol_running.loc[
ol_running['Medal'] == medals[i], 'Medal'] = ol_running[
ol_running['Medal'] == medals[i]]['Medal'].str.replace(
medals[i],
short_medals[i])
The 'Name' column also looks difficult to use:
ol_running['Name'].head()
There are alternative names/nicknames in parentheses and double quotes. The intention is to use the names later on, so to make this easier, remove sections in parentheses and double quotes, and convert the name string to lowercase. Make this a function so it can be used on the other data sets later on.
def process_names(df):
"""
Helper function to perform some cleaning on the athlete Name field.
"""
df.rename(columns={'Name': 'RawName'}, inplace=True)
df.insert(loc=df.columns.get_loc('RawName'), column='Name', value=np.NaN)
df['Name'] = df['RawName'].str.replace('\"(.*?)\"', '')
df['Name'] = df['Name'].str.replace('\((.*?)\)', '')
df['Name'] = df['Name'].str.lower()
process_names(ol_running)
ol_running.head()
Wrangling of this data set is complete, and from here on the cleaned data frame will always be called ol_running.
print("Number of unique events is {}".format(len(ol_tf['Event'].unique())))
ol_tf.head()
all_ol_tf_events = ol_tf['Event'].unique()
all_ol_tf_events
As in the previous section, this analysis will keep the individual running events and drop the remainder.
ol_tf_running_events = {
'10000M Men',
'100M Men',
'110M Hurdles Men',
'1500M Men',
'200M Men',
'3000M Steeplechase Men',
'400M Hurdles Men',
'400M Men',
'5000M Men',
'800M Men',
'Marathon Men',
'10000M Women',
'100M Hurdles Women',
'100M Women',
'1500M Women',
'200M Women',
'3000M Steeplechase Women',
'400M Hurdles Women',
'400M Women',
'5000M Women',
'800M Women',
'Marathon Women'}
indices_to_remove = [ol_tf.index[i] for i in range(len(ol_tf))
if not ol_tf['Event'].iloc[i] in ol_tf_running_events]
ol_tf_running = ol_tf.drop(index=indices_to_remove)
ol_tf_running.head()
ol_tf_running['Event'].unique()
In addition, this data set contains results for a Men's 3000m steeplechase in 1900 and 1904. However, this is an error in the data - the 1900 and 1904 Olympics featured shorter steeplechase ditances (source: https://en.wikipedia.org/wiki/Steeplechase_(athletics)). Therefore the rows for 3000M Steeplechase Men for 1900 and 1904 will be removed.
drop_steeplechase = ol_tf_running[
((ol_tf_running['Year'] == '1900') |
(ol_tf_running['Year'] == '1904')) &
(ol_tf_running['Event'] == '3000M Steeplechase Men')].index.tolist()
ol_tf_running.drop(index=drop_steeplechase, inplace=True)
This now contains the data of interest.
# Check for missing values in each column.
ol_tf_running.isnull().sum()
No missing values are shown but this is deceptive, since some of the 'Result' fields conatin the string 'None'.
ol_tf_running[ol_tf_running['Result'] == 'None'].head()
ol_tf_running.loc[ol_tf_running['Result'] == 'None', 'Result'] = pd.NaT
ol_tf_running.dropna(subset=['Result'], inplace=True)
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Women", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Men", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("M ", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace(",", "")
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = ['Hurdles', 'Marathon', 'Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_tf_running, new_columns[i], to_replace[i], replacement[i])
ol_tf_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
The aim is to convert the Results string to a datetime object, extract the time from this and store it in a feature called 'Time'. The time formats vary a lot in this data set so some cleaning is needed.
It's possible to create general groups of events that share similar formats.
# Hurdle events
ol_tf_running_hurdles_groups = ol_tf_running.groupby('Hurdles')
# Road running events
ol_tf_running_road_groups = ol_tf_running.groupby('Road')
# Steeplechase
ol_tf_running_steeplechase_groups = ol_tf_running.groupby('Steeplechase')
# Track (flat) events
ol_tf_running_trackf_groups = ol_tf_running.groupby('Track_Flat')
event_groups = [ol_tf_running_hurdles_groups,
ol_tf_running_road_groups,
ol_tf_running_steeplechase_groups,
ol_tf_running_trackf_groups]
for group in event_groups:
# Ignore the first event in each category where distance=0
for event in list(group.groups.keys())[1:]:
print("Event: {}".format(event))
print(group.get_group(event)['Result'].head(3))
This shows it's possible to define three time formats in this result set:
# Time format for the sprint events
time_format_sprints = '%S.%f'
# Time format for middle distance events
time_format_middle = '%M:%S.%f'
# Time format for long distance events
time_format_long = '%H:%M:%S'
Examining each event in more detail shows that some further processing is needed.
Steeplechase
ol_tf_running_steeplechase_groups.get_group('3000 ')['Result'].head()
# Convert to datetime and extract the time part only.
ol_tf_running.loc[
ol_tf_running['Steeplechase'] == '3000 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Steeplechase'] == '3000 ']['Result'],
format=time_format_middle).apply(datetime.time)
Hurdles
ol_tf_running_hurdles_groups.get_group('100 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('110 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('400 ')['Result'].head()
In addition, some of the time strings have a leading '0:':
ol_tf_running[ol_tf_running['Result'] == '0:54.0']
# Remove leading '0:':
ol_tf_running.loc[
ol_tf_running['Hurdles'] == '400 ', 'Result'] = ol_tf_running[
ol_tf_running['Hurdles'] == '400 ']['Result'].str.replace('0:', '')
# For all Hurdles distances convert to datetime and extract time part.
events = list(ol_tf_running_hurdles_groups.groups.keys())
# Ignore the fist event in each category where distance=0
events.remove(0)
for event in events:
ol_tf_running.loc[
ol_tf_running['Hurdles'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Hurdles'] == event]['Result'],
format=time_format_sprints).apply(datetime.time)
Track (Flat)
ol_tf_running_trackf_groups.get_group('100')['Result'].head()
ol_tf_running_trackf_groups.get_group('200')['Result'].head()
ol_tf_running_trackf_groups.get_group('400')['Result'].head()
ol_tf_running_trackf_groups.get_group('800')['Result'].head()
ol_tf_running_trackf_groups.get_group('1500')['Result'].head()
ol_tf_running_trackf_groups.get_group('5000')['Result'].head()
ol_tf_running_trackf_groups.get_group('10000')['Result'].head()
Track events for distances less than 800m all have times written in the format defined in time_format_sprints. 800m and above use the format defined in time_format_middle.
sprint_distances = ['100', '200', '400']
middle_distances = ['800', '1500', '5000', '10000']
As with the hurdles distances above, remove any leading '0:':
# Remove leading '0:':
for event in sprint_distances:
ol_tf_running.loc[
ol_tf_running['Track_Flat'] == event, 'Result'] = ol_tf_running[
ol_tf_running['Track_Flat'] == event]['Result'].str.replace('0:', '')
# For track sprint events convert to datetime and extract time part only.
for event in sprint_distances:
ol_tf_running.loc[
ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
format=time_format_sprints).apply(datetime.time)
# For track middle distance events convert to datetime and extract time.
for event in middle_distances:
ol_tf_running.loc[
ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
format=time_format_middle).apply(datetime.time)
Road
ol_tf_running_road_groups.get_group('42195 ').head()
Some specific examples show there are several problems:
ol_tf_running_road_groups.get_group('42195 ').loc[[1379]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1392]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1417]]
This shows several formatting problems:
Taking these in turn:
# Remove 'h'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('h', ':')
# Remove milliseconds:
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('\..*', '')
# Replace '-' with ':'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('-', ':')
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].head()
There are also some values that only include hours and minutes:
ol_tf_running[ol_tf_running['Result'] == '2:32']
for i in ol_tf_running[ol_tf_running['Road'] == '42195 '].index:
if len(ol_tf_running['Result'].loc[i].split(':')) < 3:
ol_tf_running['Result'].loc[i] = ol_tf_running['Result'].loc[i] + ':00'
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'],
format=time_format_long).apply(datetime.time)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_tf_running, columns_to_int)
Some names include nicknames in double quotes. There is also a string encoding problem causing some characters to be displayed wrongly. For example, 'Emil ZÃTOPEK' and 'Katrin DÃRRE' below:
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
# Check encoding of the file
with open("datasets/results.csv", 'rb') as file:
print(chardet.detect(file.read()))
So chardet still suggests the file is utf-8 encoded. So we can try to clean this up by using the ftfy package to fix the bad encodings (Reference for ftfy: https://ftfy.readthedocs.io/en/latest/)
ol_tf_running['Name'] = ol_tf_running['Name'].apply(ftfy.fix_encoding)
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
This shows the bad encodings have disappeared:
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(ol_tf_running)
ol_tf_running['Name'].head()
Female athletes are categorised as 'W' in the 'Gender' column. Change this to be 'F' for consistency with the other data sets.
ol_tf_running['Gender'] = ol_tf_running['Gender'].str.replace('W', 'F')
This concludes cleaning of the second data set, which will be named ol_tf_running from here on.
Select individual running events as with the previous two data sets.
print("Number of unique events is {}"
.format(len(top_running['Event'].unique())))
top_running['Event'].unique()
These are all valid events for this analysis. No need to remove any.
top_running.isnull().sum()
There are a few missing 'Place' values. This anlysis will not use this feature and it will not be included any further analysis anyway. No further action on this for now.
top_running.head()
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
# Replace strings ('Marathon', 'Half marathon') with distance in metres:
racetype = ['Marathon', 'Half marathon']
distance = ['42195 Road', '21098 Road']
for i in range(len(racetype)):
top_running.loc[
top_running['Event'] == racetype[i], 'Event'] = top_running[
top_running['Event'] == racetype[i]]['Event'].str.replace(
racetype[i],
distance[i])
top_running['Event'] = top_running['Event'].str.replace(",", "")
new_columns = ['Road']
to_replace = [' Road']
replacement = ['']
for i in range(len(new_columns)):
encode_events(top_running, new_columns[i], to_replace[i], replacement[i])
top_running['Event'] = top_running['Event'].str.replace(" m", "")
top_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
top_running.head()
# In column 'Date', year will be used as one of the keys to merge data sets.
# Therefore, create a separate 'Year' column and populate it.
top_running.insert(top_running.columns.get_loc('Date'), 'Year', 0)
top_running['Year'] = top_running['Date'].str.split("-", expand=True)[0]
# Convert strings in several features to integers now.
columns_to_int = ['Track_Flat', 'Road', 'Year']
string_to_int(top_running, columns_to_int)
Convert the string into a datetime object. First look at what the different time formats used in each event are:
# Road running events
top_running_road_groups = top_running.groupby('Road')
# Track (flat) events
top_running_trackf_groups = top_running.groupby('Track_Flat')
event_groups = [top_running_road_groups,
top_running_trackf_groups]
for group in event_groups:
# Ignore the first event in each category where distance=0
for event in list(group.groups.keys())[1:]:
print("Event: {}".format(event))
print(group.get_group(event)['Time'].head(3))
For the road running events, the time format is the same as already defined in time_format_long above. For the track events, most of the times have the same format ('%H:%M:%S.%f'), but there are occassional cases where the milliseconds field is missing. For these cases it's possible to use the infer_datetime_format feature of pandas.to_datetime().
# Convert strings in the Time column to datetime objects.
# Convert to datetime and extract the time part only.
events = top_running['Road'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[
top_running['Road'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Road'] == event]['Time'],
format=time_format_long).apply(datetime.time)
# Convert strings in the Time column to datetime objects.
# Convert to datetime and extract the time part only.
events = top_running['Track_Flat'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[
top_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Track_Flat'] == event]['Time'],
infer_datetime_format=True).apply(datetime.time)
Looking at the names in this data set - they seem straightforward:
top_running['Name'].head()
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(top_running)
To be consistent with the other data sets, change the possible values of the 'Gender' feature to be either 'M' or 'F' instead of 'Men' or 'Women'.
top_running['Gender'] = ['M' if top_running['Gender'].iloc[i] == 'Men'
else 'F' for i in top_running.index]
top_running.head()
Later in this analysis, times from this data set will be merged into the Olympic data set. To facilitate this, it is necessary to label which rows correspond to an Olympic Games. This will be done by comparing the 'Date' field of the result to the known dates of the Olympic Games.
# Convert dates to datetime format
top_running['Date'] = pd.to_datetime(
top_running['Date'], infer_datetime_format=True)
# What's the earliest year in the top_running data set?
min(top_running['Year'].tolist())
So there is no need to look at years before 1962.
# List of dates of Olympic summer games
# Source: https://en.wikipedia.org/wiki/Summer_Olympic_Games
# Use format Year-month-day
olympic_dates = [
['1964-10-10', '1964-10-24'],
['1968-10-12', '1968-10-27'],
['1972-08-26', '1972-09-10'],
['1976-07-17', '1976-08-01'],
['1980-07-19', '1980-08-03'],
['1984-07-28', '1984-08-12'],
['1988-09-17', '1988-10-02'],
['1992-07-25', '1992-08-09'],
['1996-07-19', '1996-08-04'],
['2000-09-15', '2000-10-01'],
['2004-08-13', '2004-08-29'],
['2008-08-08', '2008-08-24'],
['2012-07-27', '2012-08-12'],
['2016-08-05', '2016-08-21']
]
olympic_dates_df = pd.DataFrame(olympic_dates, columns=['Start', 'End'],
index=[1964,
1968,
1972,
1976,
1980,
1984,
1988,
1992,
1996,
2000,
2004,
2008,
2012,
2016])
olympic_dates_df['Start'] = pd.to_datetime(
olympic_dates_df['Start'], format='%Y-%m-%d')
olympic_dates_df['End'] = pd.to_datetime(
olympic_dates_df['End'], format='%Y-%m-%d')
top_running.insert(loc=top_running.columns.get_loc('Date'),
column='Olympics', value=False)
for y in olympic_dates_df.index:
top_running.loc[
top_running['Year'] == y, 'Olympics'] = (top_running['Year'] == y) & (
top_running['Date'] >= olympic_dates_df.loc[y, 'Start']) & (
top_running['Date'] <= olympic_dates_df.loc[y, 'End'])
top_running[top_running['Olympics']== True].head()
It is useful to label the top 10 performances in each event, for each gender, and for every year. This is because the data set includes the top 1000 performances for all events, and since the main concern for this analysis is the factors affecting the improvement of performances, it is worth identifying the top 10 performances in each year.
top_running.insert(loc=top_running.columns.get_loc('Time'),
column='Top 10', value=False)
event_categories = ['Track_Flat', 'Road']
for gender in top_running['Gender'].unique().tolist():
for category in event_categories:
events = top_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
for year in top_running[
top_running[category] == event]['Year'].unique().tolist():
best_times_per_year = top_running[
(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'].tolist()
if len(best_times_per_year) > 0:
if len(best_times_per_year) >= 10:
cutoff = sorted(best_times_per_year)[9]
else:
cutoff = sorted(
best_times_per_year)[len(best_times_per_year)-1]
top_running.loc[
(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender),
'Top 10'] = top_running.loc[
(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'] <= cutoff
else:
continue
Perform a check that this has worked:
top_running[(top_running['Road'] == 42195) & (
top_running['Top 10'] == True) & (
top_running['Year'] == 2012)].head()
This concludes the processing of this data set, and the data frame will be named top_running from this point on.
The full Olympic data set has information about athlete characteristics but no times or results. Both the track and field data set and the top running times data set have times and results, but no athlete data. So to answer questions about how results and athlete characteristics are related it is necessary to merge these data sets. Athletes often compete in multiple Olympic Games and in different events, so it will be necessary to find a match based on the year, the event, medal awardd and the athlete's name. It will be straightforward to match the year across both data sets, and also the events and medals, because the labels are already standardised. The name presents an additional challenge because it is written differently in each data set for some athletes appearing in both. For example, here is how Mo Farah's performance in the 10000 m in 2016 looks in the Olympic track and field data set:
ol_tf.loc[[1]]
Compare the way his name is written to the way it appears for the same performance in the full Olympic data set:
ol_running.loc[[66487]]
Row 66487 contains Mo Farah's performance matching the one in the Olympic track and field data, but the name is written very differently. To overcome this, a method called fuzzy matching will be used.
The aim is to merge the time data into the ol_running data frame, where it is available.
ol_running.head()
Add two columns to ol_running, one for the time and one for the merged-in name, which can be used as a sanity check for the data merging process.
ol_running.insert(loc=len(ol_running.columns), column='Time', value=pd.NaT)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'),
column='Merged_name', value=np.NaN)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'),
column='Ratio', value=np.NaN)
Now define a function to merge the times from the ol_tf_running data set into the ol_running data set. This function splits the results in each data set into groups by event, year, gender and medal awarded. This cuts the full results set into much smaller and more manageable groups. Every pass of the loop examines a pair of corresponding groups, one from each data set. Each group of results is for the same set of event, year, gender and medal awarded. The function then compares the names in each. If the strings don't match in a simple way (using str.find(), then it applies the fuzzy matching algorithm to find the best match (process.extractOne()). If the match ratio between the two strings being compared is above a threshold (chosen arbitrarily as 50) then use the two rows being compared as a match, and save the time, name and ratio in the ol_running dataset.
def merge_times(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into the
ol_running dataset. Event, year, gender, medal and athlete name
are used as inputs to match athlete data from
one data frame to the same athlete's performance
in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag switching debug on/off.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby(
[category, 'Year', 'Gender', 'Medal'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Medal'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
for medal in ol_running['Medal'].unique().tolist():
if debug:
print(medal)
try:
group_1 = ol_running_groups.get_group(
(event, year, gender, medal))
except KeyError:
if debug:
print("No results in ol_running_groups")
continue
try:
group_2 = df_groups.get_group(
(event, year, gender, medal))
except KeyError:
if debug:
print("No results in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result > -1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}"
.format(name))
# Don't replace a time if one exists
if pd.isnull(
ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Time'] = group_2.loc[
group_2['Name'] == name][
'Time'].tolist()
ol_running.loc[i, 'Merged_name'] = name
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(
name, name_options)
if debug:
print(best_match)
print("Best name: {}"
.format(best_match[0]))
print("Match confidence: {}"
.format(best_match[1]))
print("index={}"
.format(group_1[
group_1['Name'] == best_match[
0]].index))
if best_match[1] > 50:
i = group_1[
group_1['Name'] == best_match[0]].index
# Don't replace a time if one exists
if pd.isnull(
ol_running.loc[i][
'Time'].tolist()):
ol_running.loc[i,
'Merged_name'] = name
ol_running.loc[i,
'Ratio'] = best_match[1]
ol_running.loc[i,
'Time'] = group_2.loc[
group_2['Name'] == name][
'Time'].tolist()
event_categories = ['Track_Flat', 'Hurdles', 'Road', 'Steeplechase']
merge_times(ol_tf_running, event_categories)
Check how well the fuzzy matching algorithm is doing:
ol_running.loc[~ol_running['Time'].isnull()].head()
From a visual scan of the three columns corresponding to athlete name, it looks like the fuzzy matching algorithm is doing a good job of finding the correct names. The matching algorithm uses a threshold value of 50 for the match ratio. As a further check, examine the matches with the lowest match ratio:
ol_running[ol_running['Ratio'] < 70]
So there are only seven values with a match ratio below 70, and they all look correct. So the matching algorithm seems to be working well.
ol_running.loc[~ol_running['Time'].isnull()].info()
So this method has merged in 1177 time data fields. Next, merge in times from the top_running data frame. This is a little more complicated because it is necessary to group on events marked as True in the 'Olympics' feature of this data frame to screen out other performances by the same athlete in the same year. In addition, some athletes may run several heats and a final in a single Games. Therefore, it is necessary to use the time from the race with the latest date, and within the period of the Games in question. If there turn out to be more than one (e.g., if a final and a heat were run on the same day) then we choose one arbitrarily. This is not a huge problem, since we are attempting to relate performances to height, weight and age, and those factors will not change within one day anyway.
The ol_tf_running data set contained only medal-winning performances, so it was (almost) guaranteed that there would be a corresponding row in the ol_running data set. The top_running data set differs from the ol_tf_running data set in that it contains many non-medal winning performances. Therefore, it's not possible to use the 'Medal' field to group the performances and use that to help match them. This means there is a larger scope for false positives, where the fuzzy matching algorithm wrongly identifies two similar names as a match. To help solve this, the match ratio threshold is raised from 50 to 80 in this function. It's not straightforward to combine this extra complexity into the existing merge_times function, so write a new function to handle this.
def merge_times_ext(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into
the ol_running dataset.
Event, year, gender, and athlete name are used as inputs
to match athlete data from one data frame to the same
athlete's performance in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag to switch debug on/off.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Year', 'Gender'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Olympics'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
try:
group_1 = ol_running_groups.get_group(
(event, year, gender))
except KeyError:
if debug:
print("No results in ol_running_groups")
continue
try:
group_2 = df_groups.get_group(
(event, year, gender, True))
except KeyError:
if debug:
print("No results in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result > -1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}"
.format(name))
# Don't replace a time if one exists
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
latest_race_date = group_2.loc[
(group_2['Name'] == name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)][
'Time'].tolist()[0]
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(name, name_options)
if debug:
print(best_match)
print("Best name: {}"
.format(best_match[0]))
print("Match confidence: {}"
.format(best_match[1]))
print("index={}"
.format(group_1[group_1[
'Name'] == best_match[0]].index))
if best_match[1] > 80:
i = group_1[group_1[
'Name'] == best_match[0]].index
# Don't replace a time if one exists
if pd.isnull(
ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Merged_name'] = name
ol_running.loc[i, 'Ratio'] = best_match[1]
latest_race_date = group_2.loc[(group_2[
'Name'] == name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)][
'Time'].tolist()[0]
event_categories = ['Track_Flat', 'Road']
merge_times_ext(top_running, event_categories)
ol_running.loc[~ol_running['Time'].isnull()].head()
ol_running.loc[~ol_running['Time'].isnull()].info()
Merging the second data set in has increased the number of rows with a time by a few hundred.
To investigate this plot athletes' results (i.e. times) against the date of the performance. This will be done individually for each event and separately for each gender. The plots use colour to identify Olympic medal winning performances (see the key). Both the ol_tf_running and top_running datasets are used for this. The analysis will follow in section 5.1.
# Group by gender
top_running_gender_groups = top_running.groupby(['Gender', 'Top 10'])
top_running_m = top_running_gender_groups.get_group(('M', True))
top_running_f = top_running_gender_groups.get_group(('F', True))
ol_tf_running_gender_groups = ol_tf_running.groupby('Gender')
ol_tf_running_m = ol_tf_running_gender_groups.get_group('M')
ol_tf_running_f = ol_tf_running_gender_groups.get_group('F')
def build_graph_labels(gender, category, event, characteristic=None):
"""
build_graph_labels
Helper function to create strings to use in constructing the graph title
Input parameters:
gender - athlete gender group
category - type of event
event - specific distance
characteristic - athlete characteristic, default None
Returns:
gender_label - Readable gender string
event_label - Readable event name string
category_label - Readable event category string
unit - Unit for the characteristic to plot
"""
if gender == 'M':
gender_label = "Male"
else:
gender_label = "Female"
if category == 'Road':
if event == 42195:
event_label = 'Marathon'
if event == 21098:
event_label = 'Half Marathon'
category_label = 'Road Running'
if category == 'Track_Flat':
event_label = str(event)+'m'
category_label = 'Track (Flat)'
if category == 'Hurdles':
event_label = str(event)+'m'+' Hurdles'
category_label = 'Hurdles'
if category == 'Steeplechase':
event_label = str(event)+'m'+' Steeplechase'
category_label = 'Steeplechase'
if characteristic == 'Height':
unit = 'cm'
elif characteristic == 'Weight':
unit = 'kg'
elif characteristic == 'Age':
unit = 'years'
elif characteristic == 'BMI':
unit = 'm/kg*kg'
else:
unit = None
return gender_label, event_label, category_label, unit
def update_min_max(x_series, y_series, x_min, x_max, y_min, y_max):
"""
Helper to update minimum and maximum values of the x and y series
Input parameters:
x_series - A datetime.date series
y_series - A datetime.time list
x_min - earliest date found so far
x_max - latest date found so far
y_min - shortest time found so far
y_max - longest time found so far
Returns:
x_min - Earliest date
x_max - Latest date
y_min - Smallest time
y_max - Largest time
"""
if x_series.min() < x_min:
x_min = x_series.min()
if x_series.max() > x_max:
x_max = x_series.max()
if min(y_series) < y_min:
y_min = min(y_series)
if max(y_series) > y_max:
y_max = max(y_series)
return x_min, x_max, y_min, y_max
def plot_times(event_categories, debug=False):
"""
Helper function to plot finish times for athletes across all events.
Input parameters:
event_categories - list of categories of events
debug - True/False flag switching debug on/off.
Returns:
None
"""
global graph_number
summary_strings = []
x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
y_min = time(23, 0, 0)
y_max = time(0, 0, 0)
top_running_data_present = True
ol_tf_running_data_present = True
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
if category == 'Road':
events.append(21098)
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_tf_running_gender_groups.groups.keys():
if gender == 'M':
top_running_group = top_running_m
ol_tf_running_group = ol_tf_running_m
else:
top_running_group = top_running_f
ol_tf_running_group = ol_tf_running_f
plt.figure(figsize=(18, 9))
# Plot top running time data, if it exists
try:
x_series = top_running_group[
top_running_group[category] == event]['Date']
y_series = list(top_running_group[
top_running_group[category] == event]['Time'])
plt.scatter(x_series, y_series, color='b',
label='Top 10 results in year')
x_min, x_max, y_min, y_max = update_min_max(
x_series,
y_series,
x_min,
x_max,
y_min,
y_max)
except KeyError:
if debug:
print("No data from top running times for this event.")
top_running_data_present = False
# Plot each olympic medal colour, if data exists for this event
if ol_tf_running_group[(ol_tf_running_group[
category] == event)].shape[0] != 0:
x_series = pd.to_datetime(
ol_tf_running_group
[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[
(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Time'])
plt.scatter(x_series, y_series, color='gold',
label='Olympic gold medal')
x_min, x_max, y_min, y_max = update_min_max(
x_series,
y_series,
x_min,
x_max,
y_min,
y_max)
x_series = pd.to_datetime(ol_tf_running_group[
(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[
(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Time'])
plt.scatter(x_series, y_series, color='silver',
label='Olympic silver medal')
x_min, x_max, y_min, y_max = update_min_max(
x_series,
y_series,
x_min,
x_max,
y_min,
y_max)
x_series = pd.to_datetime(ol_tf_running_group[
(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[
(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Time'])
plt.scatter(x_series, y_series, color='brown',
label='Olympic bronze medal')
x_min, x_max, y_min, y_max = update_min_max(
x_series,
y_series,
x_min,
x_max,
y_min,
y_max)
else:
ol_tf_running_data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if ((ol_tf_running_data_present is True) or
(top_running_data_present is True)):
# Construct the graph title and axis labels
(gender_label,
event_label,
category_label,
unit) = build_graph_labels(
gender,
category,
event)
plt.xlabel('Year')
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times"
.format(graph_number, gender_label, event_label))
plt.legend()
plt.show()
graph_number += 1
# Print information about variation in results in graph
# Calculate difference between quickest and slowest times
y_delta = datetime.combine(
date.today(), y_max) - datetime.combine(
date.today(), y_min)
# Calculate difference as a proportion of the slowest time
y_proportion = y_delta.total_seconds() / (
(datetime.combine(date.today(), y_min) -
datetime.combine(date.today(),
time(0, 0, 0))).total_seconds())
# Calculate length of time we have plotted data over
x_delta = x_max - x_min
summary_strings.append(
"Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history"
.format(
gender_label,
event_label,
y_delta.total_seconds(),
y_proportion*100,
x_delta.days/365))
print("Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history"
.format(
gender_label,
event_label,
y_delta.total_seconds(),
y_proportion*100,
x_delta.days/365))
# Reset
top_running_data_present = True
ol_tf_running_data_present = True
x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
y_min = time(23, 0, 0)
y_max = time(0, 0, 0)
print("Summary")
for summary in summary_strings:
print(summary)
# A label for the graphs plotted
graph_number = 1
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_times(event_categories)
The analysis for the question "How have athletes' performances changed through history?" can be found in section 5.1.
This will be examined by plotting the mean value of each of the four athlete characteristics (height, weight, age, BMI) against year of competition. Multiple events are plotted on the same axis for ease of comparison. The Numpy polyfit() method is used to plot a best fit line for each event.
The analysis can be found in section 5.2
def plot_athlete_characteristics(event_categories,
characteristics,
debug=False):
"""
Helper function to plot athlete characteristics
(height, weight, age, BMI) against year of competition.
Multiple events are plotted on the same axis for comparison.
A best fit line is added for each event in the plot.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to switch debug on/off.
Returns:
None
"""
meanvals = []
years = []
global graph_number
for c in characteristics:
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Gender'])
events = ol_running[category].unique().tolist()
events.remove(0)
for gender in ol_running['Gender'].unique().tolist():
plt.figure(figsize=(18, 9))
if debug:
print(gender)
for event in events:
if debug:
print(event)
try:
ol_running_group = ol_running_groups.get_group(
(event, gender))
except KeyError:
if debug:
print("No results in ol_running_groups")
continue
for year in ol_running_group['Year'].unique().tolist():
meanvals.append(ol_running_group[
ol_running_group['Year'] == year][c].mean())
years.append(year)
plt.scatter(years, meanvals, label=str(event)+'m')
# Remove NaN values for polyfit()
nullvals = np.isnan(meanvals)
for i in np.where(nullvals)[0]:
meanvals.pop(i)
years.pop(i)
z = np.polyfit(years, meanvals, 1)
p = np.poly1d(z)
plb.plot(years, p(years))
meanvals = []
years = []
# Construct the graph title and axis labels
(gender_label,
event_label,
category_label,
unit) = build_graph_labels(gender, category, event, c)
plt.xlabel('Year')
plt.ylabel(c+'({})'.format(unit))
plt.title("Graph {0}: Variation in Mean {1} of {2} {3} Olympic Athletes Through History"
.format(
graph_number,
c,
gender_label,
category_label))
plt.legend()
plt.show()
graph_number += 1
Calculate body mass index (BMI):
ol_running.insert(
loc=ol_running.columns.get_loc('Weight'), column='BMI', value=0)
ol_running['BMI'] = ol_running['Weight'] / ((ol_running['Height'] / 100)**2)
ol_running.head()
characteristics = ['Height', 'Weight', 'Age', 'BMI']
plot_athlete_characteristics(event_categories, characteristics)
The analysis can be found in section 5.2
This will be examined by plotting each of the four athlete characteristics against their performance (i.e., time) for each event and gender. Colours are used to indicate medal-winning performances as indicated by the key.
The analysis can be found in section 5.3
ol_running_gender_groups = ol_running.groupby('Gender')
ol_running_m = ol_running_gender_groups.get_group('M')
ol_running_f = ol_running_gender_groups.get_group('F')
def plot_time_vs_characteristics(event_categories,
characteristics,
debug=False):
"""
Helper function to plot athlete characteristics
(height, weight, age, BMI) against time.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to switch debug on/off.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot each olympic medal colour, if data exists
if ol_running_group[(ol_running_group[
category] == event)].shape[0] != 0:
# Plot vertical lines showing mean, std dev
meanval = ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Time'].notna())][c].mean()
stddev = ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Time'].notna())][c].std()
representative_max_time = max(
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']))
representative_min_time = min(
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']))
plt.plot([meanval, meanval],
[representative_min_time,
representative_max_time],
color='red',
label='Mean value of characteristic')
plt.plot([meanval + stddev,
meanval + stddev],
[representative_min_time,
representative_max_time],
color='pink',
label='Mean +/- standard deviation')
plt.plot([meanval - stddev,
meanval - stddev],
[representative_min_time,
representative_max_time],
color='pink',
label='Mean +/- standard deviation')
plt.scatter(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())]['Time']),
color='gold',
label='Olympic gold medal')
plt.scatter(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())]['Time']),
color='silver',
label='Olympic silver medal')
plt.scatter(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']),
color='brown',
label='Olympic bronze medal')
plt.scatter(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())]['Time']),
color='blue',
label='No medal')
else:
data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if data_present is True:
# Construct the graph title and axis labels
(gender_label,
event_label,
category_label,
unit) = build_graph_labels(
gender,
category,
event,
c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}"
.format(
graph_number,
gender_label,
event_label,
c))
plt.legend()
plt.show()
graph_number += 1
data_present = True
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_time_vs_characteristics(event_categories, characteristics)
Now display the same set of results but colour coded to show the 20-year time period they fall into. This is to look for any relationship between the characterisitc and the results in a particular year group.
# Create a set of year groups throughout Olympic history
year_groups = np.arange(1896, 2020, 20)
year_groups[-1] += 1 # To include 2016 Games
year_groups
def plot_time_vs_characteristics_time_groups(event_categories,
characteristics,
debug=False):
"""
Helper function to plot athlete characteristics
(height, weight, age, BMI) against time.
This function uses colour codes to show the 20-year
time period into which a performaance falls.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to switch debug on/off.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
meanval_list_m = []
stddev_list_m = []
event_list_m = []
meanval_list_f = []
stddev_list_f = []
event_list_f = []
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot
if ol_running_group[(ol_running_group[
category] == event)].shape[0] != 0:
for y in range(len(year_groups)-1):
plt.scatter(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year']
>= year_groups[y]) &
(ol_running_group['Year']
< year_groups[y+1]) &
(ol_running_group[
'Time'].notna())]['Time']),
label='{0} to {1}'
.format(year_groups[y], year_groups[y+1]-1))
# Calculate mean value of the characteristic 1996-2016
meanval = ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c].mean()
stddev = ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c].std()
if gender == 'M':
if category == 'Track_Flat':
event_list_m.append(str(event)+'m')
if category == 'Hurdles':
event_list_m.append(str(event) + 'm Hurdles')
if category == 'Steeplechase':
event_list_m.append(str(event)
+ 'm Steeplechase')
if category == 'Road':
event_list_m.append('Marathon')
meanval_list_m.append(meanval)
stddev_list_m.append(stddev)
if gender == 'F':
if category == 'Track_Flat':
event_list_f.append(str(event)+'m')
if category == 'Hurdles':
event_list_f.append(str(event) + 'm Hurdles')
if category == 'Steeplechase':
event_list_f.append(
str(event) + 'm Steeplechase')
if category == 'Road':
event_list_f.append('Marathon')
meanval_list_f.append(meanval)
stddev_list_f.append(stddev)
representative_max_time = max(
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']))
representative_min_time = min(
list(ol_running_group[
(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']))
plt.plot([meanval, meanval],
[representative_min_time,
representative_max_time],
color='brown',
label='Mean value of characteristic, 1996-2016')
plt.plot([meanval+stddev, meanval+stddev],
[representative_min_time,
representative_max_time],
color='brown',
label='Mean value +/- std dev, 1996-2016')
plt.plot([meanval-stddev, meanval-stddev],
[representative_min_time,
representative_max_time],
color='brown',
label='Mean value +/- std dev, 1996-2016')
else:
data_present = False
if debug:
print("No data from Olympic data set for event.")
# Only plot if there is some data
if data_present is True:
# Construct the graph title and axis labels
(gender_label,
event_label,
category_label,
unit) = build_graph_labels(
gender,
category,
event,
c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}"
.format(
graph_number,
gender_label,
event_label,
c))
plt.legend()
plt.show()
graph_number += 1
data_present = True
# Plot the relationship between mean values of charcteristic per event
plt.figure(figsize=(18, 9))
plt.errorbar(event_list_m, meanval_list_m, yerr=stddev_list_m, fmt='o',
label='Mean +/- Standard Deviation')
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
'M',
category,
event,
c)
plt.xlabel('Event')
plt.ylabel(c + ' ' + unit)
plt.title("Graph {0}: Mean +/- Standard Deviation {1} of {2} Athletes per Event"
.format(graph_number, c, gender_label))
plt.legend()
graph_number += 1
plt.show()
plt.figure(figsize=(18, 9))
plt.errorbar(event_list_f, meanval_list_f, yerr=stddev_list_f, fmt='o',
label='Mean +/- Standard Deviation')
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
'F',
category,
event,
c)
plt.title("Graph {0}: Mean +/- Standard Deviation {1} for {2} Athletes per Event"
.format(graph_number, c, gender_label))
plt.legend()
graph_number += 1
plt.show()
plot_time_vs_characteristics_time_groups(event_categories, characteristics)
The analysis can be found in section 5.3
This is an analysis for the results in section 4.1
Graphs 1-24 show that top performances have improved for almost every event and gender combination. The one and only exception is women's 3000m steeplechase, but that is because we have very limied data (spanning only 8 years). All other events show an improvement over time.
The biggest proportional improvement is 85% in the men's marathon. But compare this to the improvement in the men's half marathon: a much smaller 4.2%. This illustrates an important point: the men's marathon has 121 years of data, but the half marathon has only the last 31 years of data. The difference in the improvement for the men's marathon and the half marathon is partly because the rate of improvement for every event was much steeper prior to about 1960. Since then, results have improved much more slowly. This can also be seen in the shape of the graphs, which often have a shape like an exponential decay curve.
Many of the men's events have improved in the range of 20-30% throughout the data set we have available. There is no clear relationship between the distance run and the proportional improvement. For example, two of the sprint events (100m and 200m) have improved by 31.5% and 19.3% respectively, and two of the longer track events (5000m and 10000m) have improved by 20.6% and 29.0% respectively. So, a similar range of improvement at two different ends of the track running distances.
The decrease in the rate of improvement since about 1960 also explains why the proportional improvement in women's events is often less than that of the equivalent men's event. Women have been only been allowed to participate at all distances relatively recently, so there is much less data available and it does not stretch back in time as far as for the men's events. However, where the length of history is comparable for each gender, it's clear that women and men are improving at a comparable rate. A good example is the half marathon, which have 34.5 years of history for the women, and 0.8 years for the men. The proportional improvements are similar: 7.37% for women, 4.22% for men.
Many of the women's events have improved in the range of 5-20%. The events with the longest history tend to be the ones which show the largest improvement, for the reasons already discussed.
This is the analysis for the results in section 4.2. Multiple events are shown on the same plot, and a best fit trend line is shown for each event.
For the track (flat) events, both male and female athletes show a similar relationship between the event distance and athlete height.
All the female flat track athletes have got slightly taller, with 1500m showing the biggest increase.
For men, 400m, 200m and 800m athletes have tended to get taller. 100m athletes have got shorter. For the other distances, the data doesn't go back very far, so from the data available these remaining heights seem unchanged through history.
There is insufficient data to draw any conclusions about the women's steeplechase. The men's event shows a clear increase in height.
Mean height increased for both women and men in all hurdles events. For men, the longer distance hurdlers (400m) tend to be taller than the shorter distances, but the reverse is true for women. The hurdlers tend to be slightly taller than athletes at their flat track equivalent distances.
Male marathoners show a very slight tendency to increase in mean height over time, and female marathoners show no change (though with a small range of available data). The marathon runners of both genders have similar height to the 10000m track runners.
For both male and female athletes there is a relationship between weight and distance of the event.
There is insufficient data to draw any conclusions about the women's steeplechase. There is no change shown in mean weight of male athletes, and this is at a similar level to male steeplechase or 5000m runners.
Mean weight increased for both women and men in all hurdles events. For men, the short distance hurdlers tend to be heavier than the longer distance hurdlers, but the reverse is true for women. The hurdlers tend to be heavier than athletes at their flat track equivalent distances, except for for female 400m hurdlers/flat runners, which have similar mean weights.
The male and female marathoners have similar weights to the equivalent 5000m and 10000m athletes. The mean weights for both remain fairly constant through history.
(Side note about the men's results (Graph 40): there is one suspicious-looking point for the 1896 Games, of an anomalously heavy athlete. This is indeed a valid data point, a competitor who weighed 106 kg. He finished 6th out of 17 athletes in the 1896 Olympic marathon, although sadly we do not know his time. He must have been quite an unusual long distance athlete with this weight. Source: https://en.wikipedia.org/wiki/Dimitrios_Deligiannis.)
For every event and gender, mean age of the competitors has increased through history.
For both genders, there is a general trend that runners of the longer distances tend to have a higher age than those of the shorter distances.
There is insufficient data to draw any conclusions about the women's steeplechase. The mean age of the male competitiors is similar to that of the 5000m flat track athletes and has stayed fairly constant.
The mean ages for both distances of hurdling, and for both genders, are similar, and also similar to the equivalent flat track distances.
The mena age of marathon runners of both genders is slightly higher than that of any of the long distance track athletes.
There is a very clear link between the distance of the race and mean athlete BMI. The shorter the distance, the higher the BMI. This is true for both genders. In addition, Graphs 49 and 50 show an approximately diverging set of lines, meaning the trend is for the mean BMI of shorter distance athletes tends to increase through history, whilst it tends to decrease for the longer distance events. The separation between events where BMI is increasing an decreasing comes between 400m and 800m.
There is insufficient data to draw any conclusions about the women's steeplechase. For men, mean BMI is tending to decrease through history, and is similar to the mean BMI of 5000m athletes.
Mean BMI has tended to increase for all events and genders except male 400m hurdles, where it has stayed fairly constant. Female 100m hurdlers have similar mean BMI to 100m flate track females, but female 400m hurdlers have a lower BMI than 400m track flat females. Male hurdlers have very similar man BMI to the euivalent track flat males.
Mean BMIs have tended to show a slight decrease through history. The mean BMIs of marathoners of both genders are similar, or slightly higher than the mean BMI of 10000m and 5000m track runners.
(Side note: Graph 56 shows the same anomalous (but real) data point for the result in 1896. See the explanation in section 5.2.2.4..)
This is the analysis for the results in section 4.3.
The previous section has already shown that athletes from a particular event tend to have common characterisitics (for example: the shorter the distance, the higher the mean BMI). This set of graphs looks more closely at Olympic athletes for a particular event (rather than across the whole population of thletes in all running events). The iam is to look for any relationship between any of the four athlete characteristics and their performance within these very select groups. Just for example: do we find that of all the athletes entere for the 100m, the heaviest athlete has a higher chance of winning? This is just a possibility to illutrate the kind of relationships we're looking for.
For height, most of the results tend to cluster near the mean height (bell curve shape). This includes both good and less good results, medal winners and non-medal winners, so really this is just reflecting the distribution of heights of the athletes for a particular event - in other words, there are more athletes with heights close to the mean for that particular event.
Results versus weights also often clusters near the mean in a bell curve shape. There are some cases where the curve is more skewed. For example Graph 80 shows more results above the mean value than below, but fewer and more consistently good results above the mean value of weight. This reflects the fact observed in the previous section that short distance athletes have tended to get heavier through history, and this has accompanied the improvement in performance.
For age, the results also tend to cluster about the mean. However, as with weight, the results sometimes follow a more skewed distribution, with the largest number of results below the mean age, and more consistently good results at higher age. This is clearer in the cases where more data points are available. This also partly reflects the changes in athlete population: section 5.2.3. noted that athletes at all events have tended to get older through history. We can observe that as mean athlete age has increased, athlete performance has also improved.
The BMI graphs have slightly different distributions for short distance, middle distance and long distance events.
This is consistent with section 5.2.4.) which showed that BMIs for short distances have tended to increase, for middle distances they have stayed fairly constant, and for long distances they have tended to decrease. This has accompanied an improvement in results for all events.
In summary, these graphs all tend to show a normal distribution or skewed normal distribution. Beyond that, there is no clear trend in the time versus characteristic graphs, when we focus narrowly on the group of athletes competing in a particular event.
This analysis aims to isolate the effect of the four athlete characteristics and see what relationship they have with performance. However, it's obvious that there are many other factors that have changed through history that could also affect performance. For example, general healh and nutrition has improved and training facilities have got better. To mitigate these effects, graphs 145-232 plot the same results as in graphs 57-144, but using colour coding to show when those performances were. This allows an easier comparison between results, and in particular, the most modern results. These graphs all show the mean and standard deviation of the characteristic being plotted for athletes in the period 1996-2016 (rather than for all athletes in the results set).
The results from 1996-2016 occupy a much narrower band of times, from the minimum to the maximum time, than the full results set. It's clear that the best performances in the 1996-2016 period are spread across a fairly wide range of values of each characteristic. As with graphs 57-144, there is no clear trend in these graphs. Top performances come from athletes within 1-2 standard deviations of the mean of any particular characteristic.
Graphs 167, 168, 191, 192, 215, 216, 239 and 240 show the relationship between the mean and standard deviation of the characteristic for the athletes competing between 1996 and 2016 for each event. The previous sections (5.3.1. and 5.3.2) showed that athletes within 1-2 standard deviations of the mean for Olympic athletes in that event have the potential to deliver top performances. These graphs are another way fo examining how critical each characteristic is for each event.
For both men and women, the mean heights in a significant number of events fall within one standard deviation of one another. Competitors in 400m flat and both hurdles events tend to have a greater height, the long distance events have the least height and the remaining events fall between and have significant overlap. A relatively short Olympic sprinter would be a similar height to a relatively tall Olympic marathon runner. The range of heights for women's 100m hurrdles looks anomalously low, though this could be due to the relatively small amount of data for female athletes in this event.
Mean weight for athletes in different events are more widely separated. The mean weight of the long distance runners is well separated from that of the short distance sprinters and does not even overlap much with the mean weights of middle distance athletes. This distinction is clear for both genders but for men in particular. So an athlete with a weight near that of a sprinter is unlikley to perform well in a marathon.
For both men and women, the standard deviation age ranges overlap for every event.
The distribtion is similar to that of mean weights. The mean weights are quite well separated, with the longer distance athletes having lower BMI, and short distance athletes having a higher BMI. A sprinter's BMI is not suitable for a marathon runner, and vice versa. Female hurdlers are an interesting case. Female 100m hurdlers tend to be shorter than female 400m hurdlers, but they have a noticeably higher BMI.
Performances have improved across all events and genders. This reflects improvements in training techniques, general health, facilities and the application of science to sport. The rate of improvement has slowed down since about 1960, which is probably due to diminishing returns as ahletes and coaches get better.
Men's performances have generally improved in the range of 20-30%. Women's performances have generally improved in the range 5-20%. However, this difference stems largely from the fact that women have only be able to compete at all distances for a relatively short time, mostly after 1960, which is when the rate of improvement has generally been slower in all events. When there is a comparable data history for both men and women in the same event (e.g., half marathon) then the improvements in both ethe men's and women's events are similar.
It's clear that athlete characteristics are important in determining which event an athlete will compete in: an athlete with a particular combination of height, weight, age and BMI (by implication) will be suited to particular events and less suited to others. This shows there is a relationship between athlete characteristics and performance.
This analysis also looked at single events in isolation and looked at the athletes competing in that event. It showed that although the characteristics of athletes for a particular event cluster about a mean, and the combination is characteristic of that event, athletes with a wide range of individual height, weight, BMI and age still produce top performances. The top performances do not cluster very closely to the mean - some top performances fall 1-2 standard deviations away from the mean.
The practical applications of these results for athletes and coaches are:
Many factors influence athletes' performance, and this study has looked at just a few of them. A more extensive study would examine the effect of factors such as training schedules (e.g., distance run per week, techniques such as weight training), diet, and athlete characteristics such as VO2 max.
This analysis is published on GitHub (https://github.com/mattjezza/ds-proj1-t2-elite-athletics) and summarised in a post on Medium titled "Analysis of Elite Running Performance using Historical Data".